POV-Ray : Newsgroups : povray.unofficial.patches : [announce] JITC: Really fast POVRay FPU : Re: [announce] JITC: Really fast POVRay FPU Server Time
27 Sep 2024 18:36:10 EDT (-0400)
  Re: [announce] JITC: Really fast POVRay FPU  
From: Wolfgang Wieser
Date: 8 Aug 2004 13:51:21
Message: <41166817@news.povray.org>
Thorsten Froehlich wrote:
> In article <41163f3c$1@news.povray.org> , Nicolas Calimet
> <pov### [at] freefr>  wrote:
>>  Interesting.  But also surprising (to me).
>>  Could you explain why it takes an order of magnitude longer to
>> jump to a function via a pointer as compared to a direct reference ?
>> (note: I'm not very knowledgeable in low-level programming, just a
>> tiny idea of some assembly instructions).
> 
> It should not at all.
> 
Hmm... After this question, I looked furter into the issue. 

First of all, quoting the GCC info: 
---------------------------------------------------------------------
Note that you will still be paying the penalty for the call through a
function pointer; on most modern architectures, such a call defeats the
branch prediction features of the CPU.  This is also true of normal
virtual function calls.
---------------------------------------------------------------------

But this cannot account for the huge difference I measured. 
And actually, my second posting on the issue must be considered partly wrong 
as well. Because it turns out that GCC will now also inline functions which 
are declared extern _and_ appear further down in the code than the 
calling location -- even when marked with __attribute__((noinline)) !!
[GCC 3.4.2 20040724 (prerelease); Seems I need to file a bug report...]

And since I did not verify that all these 3 precautions would successfully 
prevent the compiler from inlining the code, I actually measured the 
time difference between an extern and an inline call which clearly yields 
to a difference in speed. 

Okay, so let's do some really clean benchmarks this time - finally. 

Oh dear. Maybe could anybody do some independent tests concerning that 
issue? Because I will now tell you that calling an external function in 
an external library is actually _faster_ than calling it directly in the 
code when certain compiler flags are used. 
I attached my test code for review. 

So here are the timings: 

Function call      | OPT1  | OPT2
-------------------+-------+-------
int_foo(44.0);     | 3.95s | 3.58s
(*int_fooP)(44.0); | 3.57s | 3.46s
(*ext_fooP)(44.0); | 3.57s | 4.13s
-none-             | 0.37s | 0.37s

OPT1 = -ffast-math -O2 -fno-rtti
OPT2 = -ffast-math -O2 -fno-rtti -march=athlon-xp

All these values have been repeatedly measured up to +-1 in the last 
digit specified - the differences are significant. 

Hence, I think we can conclude, that there is no overhead for an 
dynamically-linked external library function call. 
[At least until somebody proves that something went wrong... :| ]

I also verified the case where the external library is calling back 
into the main code: There is no real difference again. 

Wolfgang

Here are the generated assembler instructions in all measured 
cases: 

----------<OPT1>------------<*ext_fooP>---------<OPT2>----------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    *%esi
    call    *%esi                |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
----------------------------<*int_fooP>-------------------------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    *%esi
    call    *%esi                |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
----------------------------<int_foo()>-------------------------------
.L7:                             |  .L7:
    movl    $0, (%esp)           |      movl    $0, (%esp)
    movl    $1078329344, %eax    |      movl    $1078329344, 4(%esp)
    movl    %eax, 4(%esp)        |      call    int_foo
    call    int_foo              |      ffreep  %st(0)
    fstp    %st(0)               |      decl    %ebx
    decl    %ebx                 |      jns .L7
    jns .L7                      |
-----------------------------<-none->---------------------------------
.L7:                             |  .L7:
    decl    %eax                 |      decl    %eax
    jns .L7                      |      jns .L7
---------------------------------^------------------------------------

Here are the test programs: 

---<Makefile>---------------------------------------------------------
MAINFLAGS = -ffast-math -O2 -fno-rtti
LIBFLAGS = -ffast-math -O2 -fno-rtti
#MAINFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp
#LIBFLAGS = -ffast-math -O2 -fno-rtti -march=athlon-xp

all:
    g++ $(MAINFLAGS) -DMODULE=0 -DMAIN -c dl.cc -o dl.o
    g++ $(MAINFLAGS) -DMODULE=0 -DFOO -c dl.cc -o foo.o
    g++ $(MAINFLAGS) -o test dl.o foo.o -rdynamic -ldl -lm
    g++ $(LIBFLAGS) -nostartfiles -shared -DMODULE=1 dl.cc -o foo.so
    time ./test

asm:
    gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DMAIN -S dl.cc -o dl.S
    gcc $(MAINFLAGS) -fno-exceptions -DMODULE=0 -DFOO -S dl.cc -o foo.S
------------------------------------------------------------------------

---<dl.cc>--------------------------------------------------------------
// dl.cc - Written by Wolfgang Wieser. 

#if MODULE==0
//------------------
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>

extern "C" double int_foo(double x) __attribute__((noinline));

#ifdef FOO
double int_foo(double x)
{
    //fprintf(stderr,"int_foo\n");
    return(x);
}
#endif  // FOO

#ifdef MAIN
int main()
{
    void *hdl=dlopen("./foo.so",RTLD_NOW | RTLD_LOCAL);
    if(!hdl)
    {  fprintf(stderr,"dlopen: %s\n",dlerror());  exit(1);  }
    
    dlerror();
    void *sym=dlsym(hdl,"ext_foo");
    const char *err;
    if((err=dlerror()))
    {  fprintf(stderr,"dlsym: %s\n",err);  exit(1);  }
    double (*ext_fooP)(double)=(double (*)(double))sym;
    
    double (*int_fooP)(double)=&int_foo;
    
    // These make the assembler easier to compare because it prevents 
    // function pointers from getting optimized away as "unneeded 
    // variables". 
    int_foo(23.0);
    (*ext_fooP)(23.0);
    (*int_fooP)(23.0);
    
    for(int i=0; i<0xfffffff; i++)
    {
        //int_foo(44.0);
        //(*int_fooP)(44.0);
        (*ext_fooP)(44.0);
    }
    
    return(0);
}
#endif  // MAIN

#else  // MODULE!=0
//------------------
#include <stdio.h>

extern "C" double ext_foo(double x)
{
    //fprintf(stderr,"ext_foo\n");
    return(x);
}
#endif
------------------------------------------------------------------------


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.